Senior Site Reliability Engineer

Philippines, Asia Pacific

Engineer

Full-Time

2026-03-13

2026-04-12

Skills Certificate

Key Responsibilities

24/7 Incident Command & Alerting

24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the "eyes on glass" for the organization.
Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.

Observability Strategy (Dynatrace Focus)

Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
Configure Management Zones, Alerting Profiles, and Dashboards to provide a "Single Pane of Glass."
Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
Leverage Davis AI to automatically detect anomalies and reduce alert noise.
Comprehensive Monitoring Scope:
Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
Security: Monitor for DDoS attack patterns and WAF spikes.

Resilience & Chaos Engineering

Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the system's resilience and verify that failover mechanisms work as expected.
Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute "Quick Fix" runbooks to mitigate impact before escalating to platform engineering.

Application Triage & Analysis

Deep-Dive Triage: Go beyond "system check" to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.

Governance & Reporting (Stability Cadence)

Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed "blind spots" in production.

Automation & Toil Reduction

Remediation Scripting: Develop scripts (Python/Bash) to "Auto-Heal" common issues (e.g., clearing logs when disk is full, restarting stuck services).
Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.

Required Qualifications

Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
Troubleshooting Expertise:
Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services

Apply Now